This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted in the past six months, starting in June 19, 2021. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.
In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every week we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the targets. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday. You can explore the full set of models, including their forecasts for past weeks online at the Forecast Hub interactive visualization. Other related resources include CMU Delphi’s forecast evaluation dashboard, a separate product of the Forecast Evaluation Research Collaborative, as well as the preprint Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US.
This report evaluates forecasts at the state and national level for newly reported weekly cases, deaths and hospitalizations due to COVID-19. Data from the JHU CSSE dashboard is used as ground truth data for evaluating the forecasts.
Starting on September 6, 2021, COVIDhub-ensemble only reported two-week ahead forecasts for cases, due to persistent large inaccuracies observed when forecasting beyond that. As of September 28, 2021, COVIDhub-ensemble only reports one-week ahead forecasts for cases and 14 day ahead forecasts for hospitalizations. For a more complete explanation see our blog post here.
To reduce duplication of results, the COVIDhub_CDC-ensemble and COVIDhub-ensemble are omitted from this evaluation. The COVIDhub_CDC-ensemble pulls a subset of forecasts of cases and hospitalizations from the COVIDhub-4_week_ensemble and forecasts of deaths from the COVIDhub-trained_ensemble, and the COVIDhub-ensemble nearly matches the COVIDhub-4_week_ensemble and COVIDhub-trained_ensemble predictions for those targets up to occasional small differences in the included models. As a result, the performance of the COVIDhub_CDC-ensemble and COVIDhub-ensemble models matches or nearly matches the performance of the COVIDhub-4_week_ensemble and COVIDhub-trained_ensemble on those targets. For more information about COVID19 forecast hub ensemble methods see this page.
We evaluate models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for 26 historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon.
The third and fourth tables evaluate recent/historical forecast models based on their prediction interval coverage at the 50% and 95% levels by horizon.
Scores are aggregated separately for the most recent 10 weeks and for 26 historical weeks.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 10 weeks, since October 09, 2021.The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS aggregated across horizons, with the most accurate models at the top.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 26 weeks, since June 19, 2021. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered based on their relative WIS score aggregated across horizons.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 26 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 26 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 26 week period by horizon.
This table only includes forecasts for the last 10 weeks, since October 09, 2021. For inclusion in this table, a model must have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accurate models at the top.
This table only includes forecasts for the last 26 weeks, since June 19, 2021. For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accurate models at the top.
The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 10 weeks. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning June 19, 2021 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well-calibrated model to have a value of 95% in this plot.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show recent model performance stratified by location. We only included forecasts for the last 10 weeks. Models were included if they had submitted forecasts for all 4 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative case counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states and a national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of incident reported COVID-19 cases reported each week in the US. The period between the vertical blue line and the black shows the weeks included in the “recent” model evaluations. The period between the vertical red line and the black shows the weeks included in the “historical” model evaluations.
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon.
The third and fourth tables evaluate recent/historical forecast models based on their prediction interval coverage at the 50% and 95% levels by horizon.
Scores are aggregated separately for the most recent 10 weeks and for 26 historical weeks. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 10 weeks, since October 09, 2021. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 26 weeks, since June 19, 2021. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 26 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 26 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 26 week period by horizon.
This table only includes forecasts for the last 10 weeks, since October 09, 2021. For inclusion in this table, a model must have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accurate models at the top.
This table only includes forecasts for the last 26 weeks, since June 19, 2021. For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accruate models at the top.
The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 10 weeks. The models included have submitted at least 50% of forecasts during this time. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning June 19, 2021 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well-calibrated model to have a value of 95% in this plot.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show recent model performance stratified by location. We only included forecasts for the last 10 weeks. Models were included if they had submitted forecasts for all 4 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states and a national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of incident reported Daily COVID-19 hospitalizations reported in the US. The period between the vertical blue line and the black shows the weeks included in the “recent” model evaluations. The period between the vertical red line and the black shows the weeks included in the “historical” model evaluations.
Due to technical issues Ohio has with reporting deaths, the most recent 4 weeks of death data for Ohio will not be included in this evaluation.
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon.
The third and fourth tables evaluate recent/historical forecast models based on their prediction interval coverage at the 50% and 95% levels by horizon.
Scores are aggregated separately for the most recent 10 weeks and for 26 historical weeks.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 10 weeks, since October 09, 2021.The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 26 weeks, since June 19, 2021. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered based on their relative WIS score aggregated across horizons.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 26 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 26 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 26 week period by horizon.
For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total since June 19, 2021, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts.
This table only includes forecasts for the last 10 weeks, since October 09, 2021. For inclusion in this table, a model must have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accurate models at the top.
This table only includes forecasts for the last 26 weeks, since June 19, 2021. For inclusion in this table, a model most have contributed forecasts for 5 or more weeks total during this period, or have submitted forecasts during at least 2 out of the last 3 evaluated weeks. This inclusion criteria was applied in order to score models that submitted for a substantial amount of weeks at any point during the pandemic but may no longer be submitting, while also evaluating new teams that have recently joined our forecasting efforts. The data are initially ordered by model based on their 95% PI coverage, aggregated across horizons, with the most accurate models at the top.
The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 10 weeks. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states.
For the first 2 figures, WIS is used as a metric. The first figure shows the mean WIS across all 50 states for submission weeks beginning June 19, 2021 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well-calibrated model to have a value of 95% in this plot.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show recent model performance stratified by location. We only included forecasts for the last 10 weeks. Models were included if they had submitted forecasts for all 4 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative death counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states and a national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This plot shows the observed number of incident deaths over time in the US. The period between the vertical blue lines shows the weeks included in the “recent” model evaluations.